70 research outputs found
LiteMat: a scalable, cost-efficient inference encoding scheme for large RDF graphs
The number of linked data sources and the size of the linked open data graph
keep growing every day. As a consequence, semantic RDF services are more and
more confronted with various "big data" problems. Query processing in the
presence of inferences is one them. For instance, to complete the answer set of
SPARQL queries, RDF database systems evaluate semantic RDFS relationships
(subPropertyOf, subClassOf) through time-consuming query rewriting algorithms
or space-consuming data materialization solutions. To reduce the memory
footprint and ease the exchange of large datasets, these systems generally
apply a dictionary approach for compressing triple data sizes by replacing
resource identifiers (IRIs), blank nodes and literals with integer values. In
this article, we present a structured resource identification scheme using a
clever encoding of concepts and property hierarchies for efficiently evaluating
the main common RDFS entailment rules while minimizing triple materialization
and query rewriting. We will show how this encoding can be computed by a
scalable parallel algorithm and directly be implemented over the Apache Spark
framework. The efficiency of our encoding scheme is emphasized by an evaluation
conducted over both synthetic and real world datasets.Comment: 8 pages, 1 figur
On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark
Querying very large RDF data sets in an efficient manner requires a
sophisticated distribution strategy. Several innovative solutions have recently
been proposed for optimizing data distribution with predefined query workloads.
This paper presents an in-depth analysis and experimental comparison of five
representative and complementary distribution approaches. For achieving fair
experimental results, we are using Apache Spark as a common parallel computing
framework by rewriting the concerned algorithms using the Spark API. Spark
provides guarantees in terms of fault tolerance, high availability and
scalability which are essential in such systems. Our different implementations
aim to highlight the fundamental implementation-independent characteristics of
each approach in terms of data preparation, load balancing, data replication
and to some extent to query answering cost and performance. The presented
measures are obtained by testing each system on one synthetic and one
real-world data set over query workloads with differing characteristics and
different partitioning constraints.Comment: 16 pages, 3 figure
ATEM: A Topic Evolution Model for the Detection of Emerging Topics in Scientific Archives
This paper presents ATEM, a novel framework for studying topic evolution in
scientific archives. ATEM is based on dynamic topic modeling and dynamic graph
embedding techniques that explore the dynamics of content and citations of
documents within a scientific corpus. ATEM explores a new notion of contextual
emergence for the discovery of emerging interdisciplinary research topics based
on the dynamics of citation links in topic clusters. Our experiments show that
ATEM can efficiently detect emerging cross-disciplinary topics within the DBLP
archive of over five million computer science articles
On Distributed SPARQL Query Processing Using Triangles of RDF Triples
Knowledge Graphs are providing valuable functionalities, such as data integration and reasoning, to an increasing number of applications in all kinds of companies. These applications partly depend on the efficiency of a Knowledge Graph management system which is often based on the RDF data model and queried with SPARQL. In this context, query performance is preponderant and relies on an optimizer that usually makes an intensive usage of a large set of indexes. Generally, these indexes correspond to different re-orderings of the subject, predicate and object of a triple pattern. In this work, we present a novel approach that considers indexes formed by a frequently encountered basic graph pattern: triangle of triples. We propose dedicated data structures to store these triangles, provide distributed algorithms to discover and materialize them, including inferred triangles, and detail query optimization techniques, including a data partitioning approach for bias data. We provide an implementation that runs on top of Apache Spark and experiment on two real-world RDF data sets. This evaluation emphasizes the performance boost (up to 40x on query processing) that one can obtain by using our approach when facing triangles of triples
ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics
This paper presents an algorithmic family of dynamic topic models called
Aligned Neural Topic Models (ANTM), which combine novel data mining algorithms
to provide a modular framework for discovering evolving topics. ANTM maintains
the temporal continuity of evolving topics by extracting time-aware features
from documents using advanced pre-trained Large Language Models (LLMs) and
employing an overlapping sliding window algorithm for sequential document
clustering. This overlapping sliding window algorithm identifies a different
number of topics within each time frame and aligns semantically similar
document clusters across time periods. This process captures emerging and
fading trends across different periods and allows for a more interpretable
representation of evolving topics. Experiments on four distinct datasets show
that ANTM outperforms probabilistic dynamic topic models in terms of topic
coherence and diversity metrics. Moreover, it improves the scalability and
flexibility of dynamic topic models by being accessible and adaptable to
different types of algorithms. Additionally, a Python package is developed for
researchers and scientists who wish to study the trends and evolving patterns
of topics in large-scale textual data
Leveraging Mediator Cost Models with Heterogeneous Data Sources
Projet RODINDistributed systems require declarative access to diverse data sources of information. One approach to solving this heterogeneous distributed database problem is based on mediator architectures. In these architectures, mediators accept queries from users, process them with respect to wrappers, and return answers. Wrapper provide access to underlying data sources. To efficiently process queries, the mediator must optimize the plan used for processing the query. In classical databases, cost-estimate based query optimization is an effective method for optimization. In a heterogeneous distributed databases, cost-estimate based query optimization is difficult to achieve because the underlying data sources do not export cost information. This paper describes a new method that permits the wrapper programmer to export cost estimates (cost estimate formulas and statistics). For the wrapper programmer to describe all cost estimates may be impossible due to lack of information or burdensome due to the amount of information. We ease this responsibility of the wrapper programmer by leveraging the generic cost model of the mediator with specific cost estimates from the wrappers. This paper describes the mediator architecture, the language for specifying cost estimates, the algorithm for the blending of cost estimates during query optimization, and experimental results based on a combination of analytical formulas and real measurements of an object database system
- …